Powerlifting Competitions
Powerlifting Competitions
Overview and Motivation
As we are both interested in sports, especially weightlifting, we would like to know more about powerlifting and the population of powerlifters in general. We also want to see where there are the most powerlifting competitions, and if competitions and/ or performance have any impact on the social media following of the powerlifters. Indeed, social media being such a new technology and even a new market with new jobs, we thought it would be interesting to see if the success on Instagram for example would be explained by real performance in the sports, or more by pictures, attractiveness or some marketing techniques. More generally, we are interested in seing if there is any strong relationship between bodyweight and strength, if it is possible to predict strength given a set of different features (age, gender, equipment,…) , and dive into powerlifting data in general to learn more about a discipline we like.
Initial Questions:
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Our initial questions were
To graphically summarize the population of powerlifters and to draw meaningful conclusions, to use different features such as gender, bodyweight, height and age and to try to predict the performance (predicting best squat, best deadlift and best bench press).
It would also be interesting to measure the physical statistics of the lifters (Age, Sex, Weight) and see if there is a link with their Wilks points.
We could also see if the weight/competition categories as they are today are coherent for comparison between lifters or if other divisions would make more sense.
We wanted to make a link between federations and exercises performances, and see if some federations attracts a certain type of powerlifter.
Then, we would also analyze the evolutions of the overall performances through time for a given lifter and see if any unexpected (or expected) pattern occurs. We would select lifters who have more than a certain amount of entries, but we have not yet decided this benchmark.
6)It would be interesting to extract the number of Instagram followers per powerlifter, and see if we can draw conclusions between their popularity, performances, federations and exercises.
- Finally we wanted to make a visual representation in the form of a map of the different powerlifting contests. In one of the databases we have the places of the competitions. With the help of OpenStreetMap, we thought of extracting the coordinates of those competitions and represent them as a point on a world map. It would allow us to see in a visual manner the density of contests and the most popular places/countries to compete in.
When doing the exploratory analysis, we saw that question 3) is not possible to answer as such. Indeed, we see that the lifters adapt to the constraint of the weight division, and try to get to the heaviest weight possible in their division. However, it raised the following question:
3bis) Is it really an advantage to be the heaviest in one competing division, more precisely, in the most popular competing divisions, as they vary a lot from one competition/federation to another ?
When further working on the question 5), we found the following things:
-Not a lot of powerlifters participated enough times to have a very thorough following of their progression
-They do not compete on the same years, for the same duration, at the same frequence and do not perform the same exercises.
Therefore, we could not compare the powerlifters between them to generate a model, and did not have enough data on a specific powerlifter to create a personal modal.
Instead, we decided to analyse the 5 powerlifters who compete the most and see if we can deduct anything from this analysis.
Therefore the question 5bis is now: Then, we would also analyze the evolutions of the overall performances through time for the 5 lifters that competed the most, and see if we can deduct something on the evolution of their performance.
- To graphically summarize the population of powerlifters and to draw meaningful conclusions, to use different features such as gender, bodyweight, height and age and to try to predict the performance (predicting best squat, best deadlift and best bench press).
- It would also be interesting, also to measure the physical statistics of the lifters (Age, Sex, Weight) and see if there is a link with their Wilks points.
- Is it really an advantage to be the heaviest in one competing division, more precisely, in the most popular competing divisions, as they vary a lot from one competition/federation to another ?
- We wanted to make a link between federations and exercises performances, and see if some federations attracts a certain type of powerlifter.
- Then, we would also analyze the evolutions of the overall performances through time for the 5 lifters that competed the most, and see if we can deduct something on the evolution of their performance.
- It would be interesting to extract the number of Instagram followers per powerlifter, and see if we can draw conclusions between their popularity, performances, federations and exercises.
- Finally we wanted to make a visual representation in the form of a map of the different powerlifting contests. In one of the databases we have the places of the competitions. With the help of OpenStreetMap, we thought of extracting the coordinates of those competitions and represent them as a point on a world map. It would allow us to see in a visual manner the density of contests and the most popular places/countries to compete in.
Data
A short explanation
Squat & bench & deadlift
In this section, we would like to explain the different lifts we will talk about in this work. The 3 main lifts performed during powerlifting competition are the:
- squat: a lift involving a squat done while holding a barbell on the shoulders. This exercise mainly targets the quadriceps, hamstrings, abdominals, glutes and back for stabilization.
- bench press: a lift or exercise in which a weight is raised by extending the arms upward while lying on a bench. This lift works the pectoralis, the deltoids and the triceps.
- deadlift: a lift in weight lifting in which the weight is lifted from the floor to hip level. This movement works the posterior chain (trapezius, latissiumus dorsi, glutes, hamstrings).
Wilks coefficient
The Wilks Coefficient or Wilks Formula is a coefficient that can be used to measure the strength of a powerlifter against other powerlifters despite the different weights of the lifters. It is given by the following formula:
\[Coeff = \frac{500}{a + bx + cx^2 + dx^3 + ex^4 + fx^5}\]
Coefficients
| Parameter | Men | Women |
|---|---|---|
| a | -216.0475144 | 594.31747775582 |
| b | 16.2606339 | -27.23842536447 |
| c | -0.002388645 | 0.82112226871 |
| d | -0.00113732 | -0.00930733913 |
| e | 7.01863E-06 | 4.731582E-05 |
| f | -1.291E-08 | -9.054E-08 |
Meetings
Importing the data
The second table is the one with the meetings. It has 8482 entries, and each entry has a MeetID ( which is the same as for the powerlifters’ database), the meet path, the federation of the meet, the date of the meet, the country, state and town of the meet and finally the meet name.
From this database, we want to use the MeetID, the date, the country, the federation, the state , the town and the name of each meeting.
| MeetID | MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName |
|---|---|---|---|---|---|---|---|
| 0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Championships |
| 1 | 365strong/1602 | 365Strong | 2016-11-19 | USA | MO | Ozark | Thanksgiving Powerlifting Classic |
| 2 | 365strong/1603 | 365Strong | 2016-07-09 | USA | NC | Charlotte | Charlotte Europa Games |
| 3 | 365strong/1604 | 365Strong | 2016-06-11 | USA | SC | Rock Hill | Carolina Cup Push Pull Challenge |
| 4 | 365strong/1605 | 365Strong | 2016-04-10 | USA | SC | Rock Hill | Eastern USA Challenge |
| 5 | 365strong/1701 | 365Strong | 2017-04-22 | USA | NC | Charlotte | Charlotte Europa Games |
Clean the data
We put the format of the date as year-month-day. And put the meetings’towns in characters.
We select the complete lines of the table, and create two columns. One with the meetings’towns trimmed (with no white space), and the other one with the location, adding the country to the meeting towns. This last column is meant to be used when computing the worldmaps and US maps later.
| MeetID | MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName | location |
|---|---|---|---|---|---|---|---|---|
| 0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Championships | Charlotte, USA |
| 1 | 365strong/1602 | 365Strong | 2016-11-19 | USA | MO | Ozark | Thanksgiving Powerlifting Classic | Ozark, USA |
| 2 | 365strong/1603 | 365Strong | 2016-07-09 | USA | NC | Charlotte | Charlotte Europa Games | Charlotte, USA |
| 3 | 365strong/1604 | 365Strong | 2016-06-11 | USA | SC | Rock Hill | Carolina Cup Push Pull Challenge | Rock Hill, USA |
| 4 | 365strong/1605 | 365Strong | 2016-04-10 | USA | SC | Rock Hill | Eastern USA Challenge | Rock Hill, USA |
| 5 | 365strong/1701 | 365Strong | 2017-04-22 | USA | NC | Charlotte | Charlotte Europa Games | Charlotte, USA |
Powerlifters
Data source : Kaggle openpowerlifting 2018 data :
https://www.kaggle.com/dansbecker/powerlifting-database.
Data importation
The powerlifter data has two tables. The first one is only on the lifters data. It is a csv file with 386 414 entries. Each entry has a MeetID which corresponds to a code identifying a specific competition, the lifter’s name , the sex of the powerlifter, the equipment used, the age of the powerlifter, his division ( competition category), his bodyweight in kg, his weight-class in the competition, his fourth attempt for a squat, his best squat, his fourth attempt for a bench , his best bench, his fourth attempts for a deadlift, his best deadlift, the total of kg lifted by this lifter during this meeting, his place (ranking) and his Wilks points. From this specific powerlifters’ database, we intend to use the MeetID, the name, the sex, the bodyweight, the weight-class, the equipment, the best squat, the best bench, the best deadlift, the place (ranking) and the Wilks. Upon first inspection of this table we see some implicit NAs so the first thing we do, during the import of the table, is to replace them by explicit NA.
Below a view of the raw data but with explicit NAs :| MeetID | Name | Sex | Equipment | Age | Division | BodyweightKg | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Angie Belk Terry | F | Wraps | 47 | Mst 45-49 | 59.6 | 60 | NA | 47.6 | NA | 20.4 | NA | 70.3 | 138.3 | 1 | 155.1 |
| 0 | Dawn Bogart | F | Single-ply | 42 | Mst 40-44 | 58.5 | 60 | NA | 142.9 | NA | 95.2 | NA | 163.3 | 401.4 | 1 | 456.4 |
| 0 | Dawn Bogart | F | Single-ply | 42 | Open Senior | 58.5 | 60 | NA | 142.9 | NA | 95.2 | NA | 163.3 | 401.4 | 1 | 456.4 |
| 0 | Dawn Bogart | F | Raw | 42 | Open Senior | 58.5 | 60 | NA | NA | NA | 95.2 | NA | NA | 95.2 | 1 | 108.3 |
| 0 | Destiny Dula | F | Raw | 18 | Teen 18-19 | 63.7 | 67.5 | NA | NA | NA | 31.8 | NA | 90.7 | 122.5 | 1 | 130.5 |
| 0 | Courtney Norris | F | Wraps | 28 | Open Senior | 62.4 | 67.5 | -183.7 | 170.1 | NA | 77.1 | NA | 145.2 | 392.4 | 1 | 424.4 |
Cleaning of the table Powerlifters
We then proceed to clean this table. We transform into factors the columns Sex and Equipment. We remove from the table the lines where there is no bodyweight entered. Indeed, even for the same powerlifter, we can’t extrapolate a bodyweight based on a previous measurement. We select the powerlifters who are older than 17 years old and younger than 75 years old. The first limit we put is linked to the legality of competing in most countries. Because a lot of competitions do not accept very young children, the data before 17 years is too sparse to have a good view. Moreover, variance in size, weight etc is too linked with growth as this age for it to be closely linked to performance of training. The second limit we put is basically because of the scarcity of the measures after the age of 75. This is even more se visible in females powerlifters. Finally, in order to have an age for each line of the table, we followed the following procedure:
- We extract the lines that do not have any age entered
- We extract the lines that do have an age entered
- We look at names in common between the two extractions
- The ones that are in the table with no age and not in the table with age are discarded. Indeed, we have no way to know their age at all.
- For the ones in both tables we extrapolate the age by calculating their year of birth:
- We extract the date of the meetings from the table meetings for each line
- We compute the year of birth of each powerlifter based on the lines where they have an age in the database and the year of competition.
- For each line we can put the approximate age for each powerlifter that once had an age, adjusted with the year of competition
| MeetID | Name | Date | Year_of_Birth | Age | Sex | BodyweightKg | WeightClassKg | Equipment | BestSquatKg | BestBenchKg | BestDeadliftKg | Wilks |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Angie Belk Terry | 2016-10-29 | 1969 | 47 | F | 59.6 | 60 | Wraps | 47.6 | 20.4 | 70.3 | 155.1 |
| 0 | Dawn Bogart | 2016-10-29 | 1974 | 42 | F | 58.5 | 60 | Single-ply | 142.9 | 95.2 | 163.3 | 456.4 |
| 0 | Dawn Bogart | 2016-10-29 | 1974 | 42 | F | 58.5 | 60 | Single-ply | 142.9 | 95.2 | 163.3 | 456.4 |
| 0 | Dawn Bogart | 2016-10-29 | 1974 | 42 | F | 58.5 | 60 | Raw | NA | 95.2 | NA | 108.3 |
| 0 | Destiny Dula | 2016-10-29 | 1998 | 18 | F | 63.7 | 67.5 | Raw | NA | 31.8 | 90.7 | 130.5 |
| 0 | Courtney Norris | 2016-10-29 | 1988 | 28 | F | 62.4 | 67.5 | Wraps | 170.1 | 77.1 | 145.2 | 424.4 |
To link those two databases, we wanted to join them using the MeetID .
| MeetID | MeetName | Date.x | Federation | MeetPath | Name | Date.y | Year_of_Birth | Age | Sex | BodyweightKg | WeightClassKg | Equipment | BestSquatKg | BestBenchKg | BestDeadliftKg | Wilks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Angie Belk Terry | 2016-10-29 | 1969 | 47 | F | 59.6 | 60 | Wraps | 47.6 | 20.4 | 70.3 | 155.1 |
| 2 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Dawn Bogart | 2016-10-29 | 1974 | 42 | F | 58.5 | 60 | Single-ply | 142.9 | 95.2 | 163.3 | 456.4 |
| 4 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Dawn Bogart | 2016-10-29 | 1974 | 42 | F | 58.5 | 60 | Raw | NA | 95.2 | NA | 108.3 |
| 5 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Destiny Dula | 2016-10-29 | 1998 | 18 | F | 63.7 | 67.5 | Raw | NA | 31.8 | 90.7 | 130.5 |
| 6 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Courtney Norris | 2016-10-29 | 1988 | 28 | F | 62.4 | 67.5 | Wraps | 170.1 | 77.1 | 145.2 | 424.4 |
| 7 | 0 | 2016 Junior & Senior National Powerlifting Championships | 2016-10-29 | 365Strong | 365strong/1601 | Maureen Clary | 2016-10-29 | 1956 | 60 | F | 67.3 | 67.5 | Raw | 124.7 | 95.2 | 163.3 | 392.0 |
The meetings database would also be used alone for the map visualization, whereas the powerlifters database would be used alone for the link between the players ‘physiques and their performances.
Both of them would be used to see the links between the competitions and the Instagram popularity, the competition and the following and the competition and the performances.
For the Instagram names we use the following link :
Instagram names :
https://gitlab.com/openpowerlifting/opl-data/blob/master/lifter-data/social-instagram.csv
You can retrieve the 3 different files with the followers for a subset of profiles of powerlifters here : * insta: https://drive.google.com/open?id=1Ce6VrLjzP4VWG9gsKXX9N2k6-zhwf2yk * insta2: https://drive.google.com/open?id=1VOIxQXXZ2zZ5xnTIqZA8TUU0lncc3a-F * insta3: https://drive.google.com/open?id=1UrguUyX0ezVroB-5nC9b6RdkjOu7Zagr
We found it particular difficult to extract the data from the Instagram because it does not allow us to extract more than 9000 followers every 15 minutes. Furthermore, the API we used only allowed us 10 minutes per day. This means that we can only run the API for a limited number of times during the day and then we have to wait until the next day. As it takes a lot of time to get the number of followers for all our observations (around 3500 people), we will use a subset.
Moreover, some lifters have too many followers to be able to extract them all (another limit of the API which does not allow us to extract more than 180K followers per profile).
We had to save all the followers in 3 different .csv files and imported and merged them. However, as the csv were quite heavy and GitHub did not allow us to upload them, we performed the analysis locally and then uploaded them on the drive. Then, we joined 3 different databases, one with the number of followers and the instagram accounts, one with the instagram accounts and the name of the lifter and finally the powerlifter database with all the relevant variables.
OpenStreetMap
To have a geographical representation of where the competitions of powerlifting took place, we use OpenStreeMap to retrieve the coordinates of the different places, mentioned with their respective country so that we minimize the risk of error (eg. Paris, TX vs. Paris, France) and then plotted them on a world map and a USA map (as it is where most competitions take place).
Exploratory Analysis:
Overall population statistics
We first decide to look overall at the population of powerlifters and draw some basic statistics about them. Namely, we look at the gender ratio, the equipments used by gender, and the best squat, best benchpress, best deadlift accross genders.
We also look more precisely at what happens in the year 2018 to see the bodyweight distributions, and the wilks distributions.
| Name | Sex | BestSquat | BestBench | BestDeadlift |
|---|---|---|---|---|
| A Jay Montanez | M | 225.0 | 165.0 | 270.0 |
| A Yang | M | 210.0 | 110.0 | 235.0 |
| A-Yun Lin | F | 80.0 | 52.5 | 120.0 |
| A.J. Alma Flowers | M | 113.4 | 65.8 | 111.1 |
| A.J. Schroeder | M | 242.5 | 210.0 | 275.0 |
| A.T. Tarkany | M | 317.5 | 149.7 | 272.2 |
| Gender | Mean squat | Median squat | Mean benchpress | Median benchpress | Mean deadlift | Median deadlift |
|---|---|---|---|---|---|---|
| F | 115.6 | 110 | 65.5 | 62.5 | 137.0 | 135.0 |
| M | 211.1 | 205 | 143.7 | 137.5 | 234.2 | 232.5 |
This first exploratory analysis shows us that there are more than twice as many males as females, and that they display the same patterns in general, with women being lighter. It looks like all the distributions we see in men are repeated for women but shifted to the left and shifted down in each graph. Therefore their peak performances are reached with lower weights than for males. We conclude that we must be able to use similar techniques and models to analyze males and females, but we should separate them nonetheless for analysis of weights and performance.
Wilk points and body weight
Looking at the distribution of the Wilk points per body weight and sex, we see a weird pattern regarding the weights. To further analyze it, we look at the weight distributions of males and females.
| Weights | Frequency |
|---|---|
| 90 | 4163 |
| 82 | 3784 |
| 74 | 3035 |
| 83 | 3030 |
| 89 | 2974 |
| 100 | 2941 |
| Weights | Frequency |
|---|---|
| 60 | 1896 |
| 56 | 1664 |
| 67 | 1561 |
| 52 | 1446 |
| 63 | 1390 |
| 66 | 1182 |
What we see here is that the weightlifters tend to try to get to some specific weights. Indeed, they want to get to the highest bodyweight possible in a category, believing that being the heaviest in their competing group would give them an advantage. Therefore, it will be difficult to analyse if the competition division are well made, as the lifters adapted to this constraint as given. However, we can think about analysing the weight in each categories. The problem with this question is that the categories vary a lot from a competition to another. Therefore, we will analyse only the most frequent divisions of competition.
Performances
We see that for both genders, best squat increases by bodyweight. We also see that the variance considerably increases by bodyweight. We see the same for best bench. For the best deadlift, the increase seems less linear. We will need to see how to cope with the increasing variance when modeling. For the best deadlift, specially for females, it looks like after 50 kgs of bodyweight, the slope of the increase in performance is much less steep.
Federations
We want to see where most competitions take place:
We see here that most competitions happen in the USA and in Europe. We decide not to include an interactive map because there are not enough observations (only 999 competitions) to represent. Each one is represented by a point on the maps. This representation answers the question 7) we had before.
To have a look at the different federations , we compute a first exploratory analysis. We want to see the most represented federations, as well as the means of the best squats , best benches , best deadlifts and best wilks for each federations
The first analysis we can see is that there is only one federation where all three exercises were not practiced. It is the federation APC.
It is interesting to see if this federation is very represented.
| Federations | Frequency |
|---|---|
| 365Strong | 675 |
| AAPF | 122 |
| AAU | 207 |
| APA | 577 |
| APC | 11 |
| APF | 3195 |
| BB | 329 |
| BPU | 58 |
| CAPO | 280 |
| CommonwealthPF | 193 |
| CPF | 1619 |
| CPL | 445 |
| EPA | 28 |
| EPF | 1737 |
| FESUPO | 138 |
| FFForce | 530 |
| FPO | 29 |
| GPA | 1088 |
| GPC | 1620 |
| GPC-AUS | 1061 |
| GPC-GB | 307 |
| HERC | 118 |
| IPA | 1509 |
| IPF | 6924 |
| IPL | 3321 |
| MM | 40 |
| NAPF | 926 |
| NASA | 1 |
| NPA | 45 |
| NZPF | 1482 |
| OceaniaPF | 315 |
| PA | 7256 |
| ProRaw | 166 |
| RAW | 619 |
| RPS | 115 |
| SPF | 4194 |
| UPA | 2848 |
| USAPL | 24455 |
| USPA | 49416 |
| USPF | 2489 |
| WNPF | 111 |
| WPC | 1840 |
| WRPF | 3196 |
| WRPF-AUS | 39 |
We see however that the federation APC is really underrepresented, and might not be representative at all. The most represented and popular federations are IPF, PA, USAPL, and USPA, so a table with these specific federations might be more interesting to see what happens on a big scale for scores.
the following table is dynamic, you can click on the arrows to order the values
We see that the highest wilks is IPF and it has the highest squat, Bench and Deadlift also. This can be imagined as a popular competition that is still more “elite”. In general, there is no attraction by federation that we can spot specifically, and the difference might come from geographical situations of the competitions. Therefore, the USA having the federation the most popular etc , according to the precedent map.
Lifters who competed a lot
Below, we retrieved the powerlifters who competed the most with their number of participations:
For each one of the 5 athletes who competed the most, we can see the basic statistics computed. Because they have not competed at the same moment, at the same frequency, nor for the same time period, it is really difficult to compare them and get a model out of it.
We can however analyse them individually and see if anything seems familiar, illogical or logical.
| Name | number of competitions | Number of federations | Years competed | First year of competition | Last year of competition | Age for first competition | Age for last competition |
|---|---|---|---|---|---|---|---|
| Alan Aerts | 81 | 3 | 9 | 2006 | 2015 | 50.0 | 59.0 |
| Betsy Spann | 60 | 3 | 7 | 2011 | 2017 | 54.0 | 60.0 |
| April Shumaker | 41 | 3 | 8 | 2009 | 2017 | 42.5 | 50.5 |
| Nicki I’Anson | 56 | 3 | 12 | 2004 | 2017 | 38.0 | 51.0 |
| Bonnie Aerts | 51 | 3 | 7 | 2006 | 2014 | 47.0 | 55.0 |
We can see above that the wilks are the most unpredictable, and that each athlete has a different evolution for each exercise regarding the performance. Betsy Spann has a best squat, best deadlift and best bench that evolved in a similar way, with a peak around 2016. However, the wilks did not reflect this schemes. Moreover, the smooth over each curve has not been evaluated as far as the residuals are concerned to see if this representation is statistically significant. This a big limit of this analysis. April Shumaker seems to show a dip of performance in 2014, Alan Aerts stopped quite quickly the squats and deadlifts. Nicky I’Anson has a very clear progression in deadlifts but it is not so clear for the other exercises represented, and finally, Bonnie Aerts only did deadlifts from 2010 on. Overall, it is difficult to get any conclusion on athletes who competed the most. We do not have enough data on each one to make a relevant analysis here.
Then, we decided to see graphically if there is any relationship between the best squat, best bench and best deadlift but also the wilks coefficient (relative strength) with the number of followers. We conclude that for most “average” powerlifters, their number of followers on instagram is less than 5000. There are of course exceptions. However, past a certain threshold for very good lifters, they seem to have many more followers, probably because of their outstanding capacity. This fact is particularly noticeable if we consider the wilks coefficient (past the 500s).
Strength and bodyweight
As for the wilkpoints per weight, we see the following distributions for males and females:
Here we see that there seems to be an increase of Wilks depending on the weight for male, but the increase is not as dramatic as the decrease due to age, the relation is not so clear cut. The pattern in the boxplox is interesting as it shows indentations in the median, showing that the hypothesis that the heaviest the better in ones’ category seems to hold hup.
For females, we see an increase with weight until the weight 50kg where there is a dip. The indentations of performances depending on weight are not so pronounced. We could think that maximizing ones’ weight in a competition division is not as important for females as it is for males.
Modeling
Males wilk points depending on age
First we try to fit a linear model, but it is likely not the best because of the shape we observed in the exploratory analysis.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 420.5 | 1.1 | 366.8 | 0 |
| Age | -3.2 | 0.0 | -95.9 | 0 |
We then try a polynomial model of degree 6:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 318.1 | 0.4 | 767.5 | 0.0 |
| poly(Age, 6)1 | -11572.1 | 119.9 | -96.5 | 0.0 |
| poly(Age, 6)2 | -2898.9 | 119.9 | -24.2 | 0.0 |
| poly(Age, 6)3 | 2335.8 | 119.9 | 19.5 | 0.0 |
| poly(Age, 6)4 | -1608.9 | 119.9 | -13.4 | 0.0 |
| poly(Age, 6)5 | 232.0 | 119.9 | 1.9 | 0.1 |
| poly(Age, 6)6 | 136.6 | 119.9 | 1.1 | 0.3 |
This is a much better model as it really follows the pattern of the mean of the Wilk points for a given Age. There might be some form of overfitting , and because we predict the mean, we predict a certain number. A further interesting analysis would be the one of the residuals.
Females wilk points depending on age
We again try to fit a linear model for females.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 375.1 | 1.7 | 218.2 | 0 |
| Age | -2.4 | 0.1 | -47.7 | 0 |
We see here that the model is better fitted for the females than for the males, but it is still not ideal. Given the shape we see in the exploratory analysis, it makes sense to try to fit a polynomial model.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 297.8 | 0.6 | 520.9 | 0.0 |
| poly(Age, 5)1 | -5137.6 | 107.5 | -47.8 | 0.0 |
| poly(Age, 5)2 | -1531.7 | 107.5 | -14.3 | 0.0 |
| poly(Age, 5)3 | 311.8 | 107.5 | 2.9 | 0.0 |
| poly(Age, 5)4 | -162.7 | 107.5 | -1.5 | 0.1 |
| poly(Age, 5)5 | 17.2 | 107.5 | 0.2 | 0.9 |
This is a much better model as it really follows the pattern of the mean of the Wilk points for a given Age. There might be some form of overfitting , and because we predict the mean, we predict a certain number. Fortunately, the mean and the median are close enough regarding the wilks
Prediction of performances
We first select the variables of interest to predict the 1RM of squat, bench an deadlift, namely age, sex, bodyweight and the wilks coefficient. Then, using the caret package, we create a training set with 75% of the observations and a test set with the 25% remaining. As we run a linear regression, we decide to test if any of our predictors have mulicollinearity issues. We do so with the vif coefficient. No variable has a coefficient > 5, so this is good news. Then, we use a linear model for each of our variables of interest, BestSquatKg, BestBenchKg and BestDeadliftKg. All R squared for the 3 regressions are comprised between 0.88 and 0.92. Our predictive capability is thus high but not perfect. We decided to include the Wilks coefficient because if we removed it, we could only explain 56% of the variance.
We also test another class of models, the Generalized Additive Models. To provide the quality of our prediction, we pasted the output in comment in which you can see the scoring measures. We do so to shorten the Rmarkdown creation. If we did not, the procedure would take up more time for computation. When we compare both our models, we see that the GAM is better. The RMSE is lower for each outcome variable. Furthermore, we crossvalidated our predictions by using a 2-fold cross-validation. We are aware that we could have tested more models, such as Random Forests, Neural Networks or SVM. Furthermore, we should have done parameter tuning to find an acceptable model for each of these models. However, this would have required much more time than what the class allowed but also more computational power. Overall, even if we only test 2 models, we find satisfactory predictive capabilities.
## Wilks Sex BodyweightKg Age
## 1.105602 1.412836 1.368950 1.037418
## Wilks Sex BodyweightKg Age
## 1.105602 1.412836 1.368950 1.037418
## Wilks Sex BodyweightKg Age
## 1.105602 1.412836 1.368950 1.037418
## RMSE Rsquared MAE
## 18.5034736 0.9272838 13.6648145
## RMSE Rsquared MAE
## 19.1232242 0.9023799 14.1584811
## RMSE Rsquared MAE
## 16.7626677 0.8908911 12.2687903
Final Analysis
Below are the final answers to our questions:
Question 1
1) To graphically summarize the population of powerlifters and to draw meaningful conclusions, to use different features such as gender, bodyweight, height and age and to try to predict the performance (predicting best squat, best deadlift and best bench press).
In the exploratory analysis we saw that there are twice as many males as females, that overall the performances of the females followed the same patterns as those of the males, but with lower results in general, as they are also lighter. We saw the demographics of the weight repartition, with more people ar the limit of each division of the competitions. We did not have a data regarding the height of the lifters.
As for the predictions, using 4 different features, namely the age, the gender, the bodyweight and the wilks coefficient, and with a linear model, we can have a good accuracy (around 90% of the variance is explained by our model) or an RMSE in the range of 16 to 18. We also use a GAM model which gives us some more precision but the complexity is also greater. Overall, even with a simplistic model, we can already have a good predictive capability.
Question 2
2) It would also be interesting, also to measure the physical statistics of the lifters (Age, Sex, Weight) and see if there is a link with their Wilks points.
We saw that the wilk means for each age strongly depended on the age of the lifters, with polynomial models for both females and males but with different degrees. Both exhibit a peak around the age of 25 with a gradual decrease after.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 318.1 | 0.4 | 767.5 | 0.0 |
| poly(Age, 6)1 | -11572.1 | 119.9 | -96.5 | 0.0 |
| poly(Age, 6)2 | -2898.9 | 119.9 | -24.2 | 0.0 |
| poly(Age, 6)3 | 2335.8 | 119.9 | 19.5 | 0.0 |
| poly(Age, 6)4 | -1608.9 | 119.9 | -13.4 | 0.0 |
| poly(Age, 6)5 | 232.0 | 119.9 | 1.9 | 0.1 |
| poly(Age, 6)6 | 136.6 | 119.9 | 1.1 | 0.3 |
A strength of the model resided in its simplicity, whereas a weakness could be that the residuals were not analysed and that we do not have confidence intervals around it. Moreover, we predict the mean and not the wilk per person individually, meaning that confidence intervals could be useful to interpret where an individual could land.
A relation between the weight and the wilks were also shown as there is no clear relation to model, but for males, it seems that the heavier in the division, the higher the wilks, whereas for the females this relation is not so clear.
Question 3
3) Is it really an advantage to be the heaviest in one competing division ? More precisely, in the most popular competing divisions, as they vary a lot from one competition/federation to abother.
As we saw in the answer to the question 2, for males it seems it is beneficial regarding their wilk points to be the heaviest in their categories, but it is not so clear for females. Unfortunately, we did not have enough time to explore statistically more thoroughly this question. The answer is therefore merely descriptive.
Question 4
4) We wanted to make a link between federations and exercises performances, and see if some federations attracts a certain type of powerlifter.
We saw no clear relation between the federations and the exercises performances, and found that people must choose it with geographical convenience in mind.
The most popular federations seem to be in the US, more precisely, the USPA :
| Federations | Frequency |
|---|---|
| 365Strong | 675 |
| AAPF | 122 |
| AAU | 207 |
| APA | 577 |
| APC | 11 |
| APF | 3195 |
| BB | 329 |
| BPU | 58 |
| CAPO | 280 |
| CommonwealthPF | 193 |
| CPF | 1619 |
| CPL | 445 |
| EPA | 28 |
| EPF | 1737 |
| FESUPO | 138 |
| FFForce | 530 |
| FPO | 29 |
| GPA | 1088 |
| GPC | 1620 |
| GPC-AUS | 1061 |
| GPC-GB | 307 |
| HERC | 118 |
| IPA | 1509 |
| IPF | 6924 |
| IPL | 3321 |
| MM | 40 |
| NAPF | 926 |
| NASA | 1 |
| NPA | 45 |
| NZPF | 1482 |
| OceaniaPF | 315 |
| PA | 7256 |
| ProRaw | 166 |
| RAW | 619 |
| RPS | 115 |
| SPF | 4194 |
| UPA | 2848 |
| USAPL | 24455 |
| USPA | 49416 |
| USPF | 2489 |
| WNPF | 111 |
| WPC | 1840 |
| WRPF | 3196 |
| WRPF-AUS | 39 |
The only real outlier in performance seems to be FPO, but it only has 29 participants, which might not be very representative.
Question 5
5) Then, we would also analyze the evolutions of the overall performances through time for a given lifter and see if any unexpected (or expected) pattern occurs. We would select lifters that have more than a certain amount of entries, but we have not yet decided this benchmark.
For each one of the 5 athletes who competed the most, we can see the basic statistics computed. Because they have not competed at the same moment, at the same frequency, nor for the same time period, it is really difficult to compare them and get a model out of it.
We can however analyse them each one individually and see if anything seems familiar, illogical or logical.
As we said in the exploratory analysis, we can see above that the wilks are the most unpredictable, and that each athlete has a different evolution for each exercise regarding the performance. Betsy Spann has a best squat, best deadlift and best bench that evolved in a similar way, with a peak around 2016. However, the wilks did not reflect this schemes. Moreover, the smooth over each curve has not been evaluated as far as the residuals are concerned to see if this representation is statistically significant. This a big limit of this analysis. April Shumaker seems to show a dip of performance in 2014, Alan Aerts stopped quite quickly the squats and deadlifts. Nicky I’Anson has a very clear progression in deadlifts but it is not so clear for the other exercises represented, and finaly, Bonnie Aerts only did deadlifts from 2010 on. Overall, it is difficult to get any conclusion on athletes that competed the most. We do not have enough data on each one to make a relevant analysis here.
Question 6
6) It would be interesting to extract the number of Instagram followers per powerlifter, and see if we can draw conclusions between their popularity, performances, federations and exercises.
We conclude that for most “average” powerlifters, their number of followers on instagram is less than 5000. There are, of course exceptions. However, past a certain threshold for very good lifters, they seem to have many more followers, probably because of their outstanding capacity. This fact is particularly noticeable if we consider the wilks coefficient (past the 500s).
Question 7
7) Finally we wanted to make a visual representation in the form of a map of the different powerlifting contests. In one of the databases we have the places of the competitions. With the help of OpenStreetMap, we thought of extracting the coordinates of those competitions and represent them as a point on a world map. It would allow us to see in a visual manner the density of contests and the most popular places/countries to compete in.
We see here that most competitions happen in the USA and in Europe. We decide not to include an interactive map because there are not enough observations (only 999 competitions) to represent. Each one is represented by a point on the maps. This representation answers the question 7) we had before.